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Abstract 

This study reports on the development and field study of K-TEEM, a web-based assessment 
instrument designed to measure mathematical knowledge for teaching (MKT) at the early 
elementary level. The development process involved alignment with early elementary curriculum 
standards, expert review of items and scoring criteria, cognitive interviews with practicing 
teachers, a field test involving 405 practicing teachers, and data modeling using a Rasch model. 
Several examples of MKT at the early elementary level are provided, and some of the challenges 
and decisions made during the process of item and scale development are discussed. Rasch 
model results indicate good model fit and adequate reliability, and the model accounts for more 
than 75% of the variance in the data. The K-TEEM assessment instrument may fill an important 
gap in the set of tools available to researchers for program evaluation and empirical investigation 
of teacher knowledge. 
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Developing an Assessment Instrument to Measure Early Elementary Teachers’ Mathematical 

Knowledge for Teaching 

There is widespread agreement among scholars involved with research in teacher 
education that teachers’ influence on their students’ learning depends upon the teachers’ subject- 
matter knowledge and their ability to draw upon that knowledge in the practice of teaching 
(Borko & Putnam, 1996; Ma, 1999; Moats, 2009; Shulman, 1986). Speaking specifically of 
mathematics, Fennema and Franke (1992) wrote, “some scholars suggest that since one cannot 
teach what one does not know, teachers must have in-depth knowledge not only of the specific 
mathematics they teach, but also of the mathematics their students are to leam in the future” (p. 
147). Consistent with the premise that teachers cannot teach what they do not know, the theory 
of change in most teacher professional development in mathematics and science posits that 
teacher professional development has a direct effect on teacher knowledge, and this direct effect 
results, indirectly, in improvements to classroom instruction and increases in student learning 
(Smith & Banilower, 2006). Guided by this theory of change, many teacher professional 
development programs have made it a primary goal to increase teachers’ subject-matter 
knowledge (Garet, Heppen, Walters, Smith, & Yang, 2016; Sowder, 2007). 

Confirmation of the link between teacher knowledge and student achievement in large- 
scale studies has had limited success, and the extant positive results seem disproportional to the 
firm beliefs and strong rhetoric in the broader literature on teacher education (Carlisle, Kelcey, 
Rowan, & Phelps, 2011; Hill, Ball, Blunk, Goffney, & Rowan, 2007; Hill, Rowan, & Ball, 2005; 
National Mathematics Advisory Panel, 2008; Rockoff, Jacob, Kane, & Staiger, 2011; Yoon, 
Duncan, Lee, Scarloss, & Shapley, 2007). There are several plausible explanations for the 
shortcomings in the evidence. Of course, one explanation could be that teachers’ subject-matter 
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knowledge is simply not as important of a factor in teaching and learning as scholars believe it to 
be. Yet another explanation could be our limited ability to identity or measure the facets of 
teacher knowledge that matter. 

It is possible that further development and refinement of assessment instruments designed 
to measure teacher knowledge may result in the creation of those critically important tools 
needed by researchers and evaluators to measure the effect of teacher PD programs at a large- 
scale and gain insight into those facets of teacher knowledge that are associated with student 
learning. Development of reliable instruments that are valid for large-scale use in rigorously 
designed studies is difficult and resource-intensive (Hill, Sleep, Lewis, & Ball, 2007). If a 
construct of teacher knowledge is ill-defined or poorly aligned to the type of knowledge that is 
most effective in supporting student learning, then a test designed to measure that construct may 
fail to detect the facet(s) of teacher knowledge that matters for student learning. Further, an 
assessment instrument can suffer from limitations due to construct-irrelevance or construct 
underrepresentation (AERA, APA, & NCME, 2014), which could lead to failure to detect the 
association between teacher knowledge and student learning. 

The purpose of the current article is to describe and discuss a method used over the 
course of one year to develop an assessment instrument to measure teacher knowledge. Attempts 
were made to align the content of the assessment instrument with the mathematics early 
elementary teachers are expected to teach and the goals of two mathematics professional 
development programs. We describe an iterative and overlapping process of item writing and 
revision, expert review, use of test items in cognitive interviews with practicing teachers from 
the target population, and data modeling. The process was designed to continually strive toward 
clarification of the construct we were trying to measure and to minimize construct-irrelevant 
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variance in the resulting data. Because the resulting assessment instrument is designed 
specifically to measure knowledge for teaching early elementary mathematics, we refer to this 
assessment instrument as the K-TEEM. 

Why focus on knowledge for teaching early elementary mathematics? 

Following decades of research on general pedagogical knowledge needed for teaching, 
Shulman (1986) introduced the construct of pedagogical content knowledge, which he described 
as . .the particular form of content knowledge that embodies the aspects of content most 
gennane to its teachability” (p. 9). Elaborating on Shuhnan’s theory and applying it within 
mathematics, Ball and her colleagues theorized a delineation of multiple facets within the 
domains of content knowledge (CK) and pedagogical content knowledge (PCK) in a construct 
they named Mathematical Knowledge for Teaching (MKT; Ball, Thames, & Phelps, 2008). 

At present, the most well-known and widely used measures of MKT are those derived 
from the item bank developed through the Study of Instructional Improvement (SII) and 
Learning Mathematics for Teaching (LMT) projects (Hill, Schilling, & Ball, 2004; LMT, 2004). 
Arguably, the second most widely known instrument(s) used by program evaluators and 
researchers to measure MKT are the DTAMS scales (Bush, Ronau, Brown, & Myers, 2006; 
Saderhohn, Ronau, Brown, & Collins, 2010). Campbell and her colleagues (2014) recently 
developed another instrument designed to measure teachers’ MKT—both CK and PCK. 

While the number of high-quality items and scales that can be used to efficiently to 
measure teachers’ knowledge for teaching mathematics is on the rise, the content coverage in 
those existing scales tends to be most relevant to the content expected to be taught by upper¬ 
elementary and middle-grades teachers. By design, the measures developed by Campbell et al. 
(2014) are aligned with the content in the standards for upper-elementary and middle-grades 
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mathematics. In a validity study of the SII/LMT items and scales, the LMT developers identified 
the early-elementary subject matter (i.e., K-2) as an area with a need for further development 
(Hill & Ball, 2004; Seidel & Hill, 2003). 

In our own initial efforts to measure the knowledge of teachers involved in a mathematics 
professional development project, we administered a pretest in Summer 2013 consisting of items 
gathered from the LMT 1 item bank closely related to the professional development that treatment 
teachers would receive. We searched the LMT item ha nk to select items that met the following 
criteria: (a) the content of the item focuses on the topic of number, operations, or algebraic 
thinking, (b) the numbers presented in the item involve whole numbers and do not involve 
common fractions, decimal fractions, or negative integers, and (c) items involve specific 
numbers and do not require teachers to interpret letters or other symbols as variables. Our search 
yielded 23 items that met these criteria, and all 23 of these items were used to construct a paper- 
and-pencil assessment. 

Intending to use the scale to measure the effects of a professional development program, 
we administered the 23-item scale with more than 200 public school elementary teachers and 
math coaches in a single southeastern state as a pretest in Summer 2013. Using this sample, the 
Cronbach’s alpha-reliability estimate for the 23-item scale was .61. The low reliability estimate 
and limited number of items fitting our content specifications was the impetus for development 
of an instrument designed to be a reliable measure of MKT that would be valid for use with the 
general population of U.S. teachers working with early elementary grades students. 

Theoretical Framework 

Hill, Rowan, and Ball (2005) define Mathematical Knowledge for Teaching (MKT) as 
“the mathematical knowledge used to carry out the work of teaching mathematics ” (emphasis as 
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in original; p. 373). We conceptualize the work of teaching mathematics to include interactions 
with students in the classroom setting as well as activity in related contexts such as planning for 
teaching and reflecting on teaching and learning (Ball, Thames, & Phelps, 2008; Goldsmith, 
Doerr, & Lewis, 2014). Ball, Thames, and Phelps (2008) suggest several sub-domains that 
compose the larger domain of MKT. These sub-domains of MKT include: common content 
knowledge (CCK), specialized content knowledge (SCK), knowledge of content and students 
(KCS), knowledge of content and teaching (KCT), horizon content knowledge (HCK), and 
knowledge of content and curriculum (KCC). 

In designing the K-TEEM scale, we used an iterative process to 1) identify important and 
measurable facets of knowledge for teaching early elementary mathematics, 2) sort these various 
facets of knowledge into the theoretical categories of the existing MKT framework, and 3) write 
items to try to yield insight into whether teachers have these facets of knowledge. The resulting 
K-TEEM instrument includes items that reflect four of the theoretical subdomains of MKT, 
CCK, SCK, KCS, and KCT. HCK involves knowledge of the mathematics topics that students 
will encounter in the future. For early elementary, one of those topics on the horizon might be 
rational numbers (e.g., fractions). We ruled out measuring HCK, because our intent was to focus 
specifically on the topics the students are expected to leam (and teachers are expected to teach). 
Because teachers in different places use different textbooks, and textbooks are continually 
revised, we chose to refrain from focusing on KCC. But we did use the Common Core Standards 
for Mathematics (CCSS-M; NGACBP & CCSSO, 2010) as a general guideline for delineating 
content and determining what kind of topics are fair to expect most U.S. teachers to know. In the 
following sections, we briefly describe our working definitions of each of these four subdomains 
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of MKT and provide original examples of types of knowledge we attempted to measure within 
each of the subdomains. 

Common Content Knowledge 

Ball, Thames, and Phelps (2008) define Common Content Knowledge (CCK) as “the 
mathematical knowledge and skill used in settings other than teaching” (p. 399). For example, 
most people who work with mathematics regularly have mathematical knowledge that allows 
them to solve equations such as 200 - x = 186 in a variety of ways. This knowledge is likely to 
be useful in the act of teaching, but it is not uniquely useful to the work of teaching mathematics. 

Our working definition of CCK also includes knowledge of the fonnal use of 
mathematical vocabulary and conventions of notation commonly acknowledged in the broader 
mathematics community. For instance, this knowledge might involve an awareness of a 
distinction in meaning between the words expression and equation in mathematics. As another 
example, a person with strong CCK might be expected to recognize the commutative property of 
addition by name or understand why Equation 1, intended to explain how a person might add 34 
and 16, violates generally acknowledged conventions of formal mathematical notation. 

34 + 6 = 40+ 10 = 50 Eq(l) 

Specialized Content Knowledge 

Ball, Thames, and Phelps (2008) define specialized content knowledge (SCK) as “the 
mathematical knowledge and skill unique to teaching” (p. 400). The authors discuss SCK as a 
way of knowing about mathematics that is uniquely useful in teaching and not necessarily 
required or useful by persons working in other professions that use mathematics. They offer an 
example of how teachers use “decompressed” knowledge of the mathematics they teach in order 
to efficiently size up the conceptual basis of a student’s error. 
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Consistent with our view of teaching as including planning for instruction and 
participating in professional discussions with other teachers, our working conceptualization of 
SCK includes the knowledge of shared, professional vernacular related to the teaching and 
learning of mathematics. For example, we believe that early elementary grades teachers with 
strong SCK are aware of the differences in semantic structure among addition and subtraction 
word problems, various equations that would model the structure of the problem, and terms to 
describe these differences (Carpenter, Fennema, Franke, Empson, & Levi, 1999; Fuson, 1992; 
Nesher, Greeno, & Riley, 1982; Verschaffel, Greer, & DeCorte, 2007). For example, the 
following problem is considered a compare- type problem with the difference unknown ; it would 
not be considered to be a change unknown problem. Luca has six trophies. Sofia has four 
trophies. How many more trophies does Luca have than Sofia? The Operations and Algebraic 
Thinking domain in the CCSS-M references taxonomies based on these factors in both first and 
second grade (NGACBP & CCSSO, 2010). Thus, understanding the CCSS-M requires a teacher 
to be aware of these distinctions. 

Whereas this knowledge of professional vernacular may be important in teaching 
children mathematics, professionals who do not teach children mathematics (or study how 
people leam mathematics) are unlikely to know the vernacular or find it useful to know this in 
their own professional work. This is analogous to reading teachers knowing specialized tenns 
such as morphemes and phonemes—terms that lay persons don’t need to know in order to be 
able to read sufficiently well. It is analogous to medical doctors using Greek or Latin words to 
describe parts of the body. The layperson need not know the vernacular used by doctors to 
identify body parts, but the professional vernacular enables efficient and precise conversations 
among professionals within the medical community. 
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Knowledge of Content and Students 

Knowledge of Content and Students (KCS) is “knowledge that combines knowing about 
students and knowing about mathematics” (Ball, Thames, & Phelps, 2008, p. 401). KCS is the 
amalgamated knowledge about how students think about mathematics that makes it possible for 
teachers to accurately predict or diagnose how students think about and interact with 
mathematics content. For example, Ball et al. (2008) include teachers’ abilities to predict and 
categorize common errors made by learners as examples of KCS. Notice how the ability to 
predict that a given group of second-grade students will make a particular error (a matter of 
KCS) is categorically different from being able to recognize that an error has been made (a 
matter of CCK). 

Teachers with high levels of KCS are able to anticipate the most common ways that 
learners with different levels of understanding will approach problems, and these teachers know 
which of the problems will generally be the easiest and most difficult for students to solve. For 
example, when presented with the equation 10 = 7 + 3 and asked whether the equation is true or 
false, first- and second-grade students typically answer false (Schoen, LaVenia, Champagne, & 
Farina, 2016; Schoen, LaVenia, Champagne, Farina, & Tazaz, 2016). Teachers with strong KCS 
will (a) know that students are likely to answer this question incorrectly and (b) be able to 
explain why a student might think the equation is false. 

As another example, consider the following word problem: Iris had nine flowers. She 
gave some flowers to her mother. Now, she has three flowers. How many flowers did Iris give to 
her mother? A teacher with strong KCS would expect many first or second grade students to 
write an equation structured like this: 9 - x = 3. Teachers with low KCS for early elementary 
mathematics are often surprised to see young children use the 9 - x = 3 equation structure to 
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model this problem rather than using the way that most adults would think of it, which is to think 
in terms of the equation 9 — 3 = x (T. Carpenter, personal communication, October 2, 2014). 

Knowledge of Content and Teaching 

Ball, Thames, and Phelps (2008) discuss Knowledge of Content and Teaching (KCT) as a 
type of knowledge that “combines knowing about teaching and knowing about mathematics” (p. 
401). KCT is knowledge that facilitates skillful instructional design—the design and sequencing 
of specific mathematics problems and experiences to provoke particular aspects of student 
thinking and accomplish specific instructional goals. Instructional decisions that draw on KCT 
require “.. .coordination between the mathematics at stake and the instructional options and the 
purposes at play” (Ball, Thames, & Phelps, 2008, p. 401). 

One fundamentally important idea in mathematics that early elementary students are 
expected to learn is the notion of place value (NGACBP & CCSSO, 2010). A typical related task 
in textbooks involves presenting students with a numeral, such as 50, and directing the student to 
circle the numeral in the tens place. If the student circles the 5, the teacher or test developer 
infers that the student has some understanding of place value. If not, the teacher or test developer 
infers that the student does not understand place value. 

Consider the following problem: There are 5 people at Sally’s birthday party. Each 
person eats 10 pieces of candy. How many pieces of candy are eaten? This problem can be 
considered to be a multiplication problem, but there is another important aspect of this problem 
related to place value in a base-ten number system. Teachers with high levels of KCT can 
recognize the grouping-by-tens structure in the word problem, and they can see how this problem 
would be a useful tool for teaching and formative assessment of place value understanding. We 
have observed in our work that not all teachers or school administrators notice the relation 
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between the situation in this word problem and opportunities for students to learn about or 
demonstrate their understanding of place value. 

Description of the K-TEEM Test Development Process 

We developed items intended to measure teachers’ MKT in a way that focused on the 
types of knowledge that teachers in the early elementary grades may need to know to teach 
number, operations, and algebraic thinking. Table 1 presents the major phases of the 
development process we used. 

Item Generation 

To define the content focus for the assessment instrument, we started with a close review 
of the CCSS-M (NGACBP & CCSSO, 2010) and the learning goals for teachers in two 
professional development programs: Cognitively Guided Instruction (CGI: Carpenter et ah, 
1999; Fennema et ah, 1999), and Thinking Mathematics (Bodenhausen et ah, 2014). These two 
professional development programs both focus on number and operations—the mainstay of the 
elementary mathematics curriculum. Both programs encourage teachers to use the following 
strategies to guide their instructional decisions: use story problems to introduce mathematical 
concepts, build on students’ existing and intuitive understanding of mathematical ideas, 
emphasize both conceptual and procedural learning, and make continual adjustments to the 
instructional plan based on ongoing formative assessment. Both of these programs are aligned 
with student learning expectations identified in the CCSS-M. 

The development team used the CCSS-M as a touchstone to provide guidelines for 
avoiding overalignment of the instrument to the specific professional development programs 
being evaluated (Slavin & Madden, 2011). We targeted the content found at the intersection of 
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the learning goals of the teacher professional development programs and the learning goals for 
students described in the elementary CCSS-M. 

After identifying and naming specific learning goals of the PD programs that might be 
both measureable and consistent with the CCSS-M, we considered how these facets of 
knowledge might map onto the theoretical framework for MKT proposed by Hill, Ball, and 
colleagues (Ball, Thames, & Phelps, 2008; Hill, Rowan, and Ball, 2005). We then established a 
target blueprint for the MKT instrument and drafted items in accordance with this blueprint. 

Items available to us through various existing instruments designed to measure facets of 
MKT were reviewed for inspiration (LMT, 2004; Rittle-Johnson, Matthews, Taylor, & 
McEldoon, 2011; Saderhohn, Ronau, Brown, & Collins, 2010; Wheeler, 2010). Items from these 
sources were used with permission from their original authors and modified and adapted for use 
in this new instrument. The K-TEEM scale included one item that was adapted from each of the 
four referenced sources. 

Item types for the K-TEEM included multiple-choice, fill-in-the-blank, and constructed- 
response items. We avoided creating multiple items referencing the same prompt (e.g., item sets, 
testlets) as an attempt to maintain the independence of each of the items in the test. Multiple- 
choice items were designed to include one and only one correct response. Both of those decisions 
were made in support of the goal to simplify scoring and interpretation. The use of all of the 
above, undecided, and none of the above options in multiple-choice items was discouraged. We 
expended considerable effort to only write response options that the practicing teachers would 
consider plausible or otherwise reflected their thinking (Haladyna, Downing, & Rodriguez, 

2002). The item generation phase occurred over a period of four months of daily effort with a 


team of four item writers. 
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Item Refinement 

After the initial drafting of items, we used two activities iteratively to vet and refine the 
item bank: (1) consultation and discussion with experts, and (2) cognitive interviews with early 
elementary teachers. Both of these will be described briefly in the following sections. 

Expert review. First, experienced classroom teachers, teacher professional development 
leaders, and other experts in mathematics and mathematics education reviewed the draft items 
and provided feedback. We specifically elicited feedback on (a) what the experts thought each 
item was measuring, (b) potential issues related to each item’s clarity and validity, (c) what to 
accept as a correct answer for the item, and (d) how difficult the items would be for respondents. 
A major goal at this stage was to identify whether the questions were well-posed and to make 
sure that the detennination of correct answers would be acknowledged by all experts in the field 
regardless of their potential differences of opinion (Downing, 2006). 

We decided whether to keep, eliminate, or revise items based on this initial round of 
expert feedback. From the bank of approximately 70 items developed through this process, 55 
items were judged to have valid correct answers and to be aligned with the content of the draft 
test blueprint. These items were then advanced to the next phase of development to be used in 
the cognitive interviews with a small set of early elementary teachers. Organized by MKT 
subdomain and further sub-categories within the subdomains, the number of draft items available 
for the cognitive interviews are presented in Table 2. 

Cognitive interviews. Cognitive interviews involve asking respondents to perform tasks 
in the presence of an interviewer and verbalize their thought processes during and after they 
perform the tasks (Desimone & LeFloch, 2004). The interviewer observes the respondent and 
asks questions to further clarify how the respondent was thinking about the items and responses. 
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Using data collected through cognitive interviews, we used a critical eye to gain insight 
into whether the items were consistently yielding infonnation about the types of knowledge the 
items were intended to measure. As interviewees revealed reasons for their answers, we learned 
about the aspects of the items that respondents tended to overlook, how they interpreted the 
questions, and how they responded. 

Three of the authors of this paper (who were also developers of the items on the K-TEEM 
test) served as the interviewers for the cognitive interviews. All three had intimate knowledge of 
what each item was designed to measure as well as prior experience with using questioning 
techniques to probe the details of teachers’ thinking about mathematics and mathematics 
teaching and learning. 

In the first round of cognitive interviews, five teachers participated in interviews lasting 
between 90 and 120 minutes. In preparation for the interviews, we set a maximum time limit of 
120 minutes to be considerate of participants’ time. Interviewers were instructed to tenninate the 
interview earlier if the interviewer perceived the interviewee to be experiencing significant 
fatigue or frustration. Interviewers noticed that teachers tended to show signs of fatigue in the 
cognitive interviews at 75-90 minutes. These signs included sighs, comments about being tired, 
and flipping through the booklet of questions to see how many questions remained. Sometimes 
the interviewess said directly that they were tired and ready to stop. Along with field notes taken 
by the interviewers, each interview was audiotaped for subsequent review and analysis by the 
item development team. 

After the first round of cognitive interviews, the data generated from the interviews were 
compiled and analyzed. Item by item, the development team carefully examined and compared 
how interviewees responded. Based on these detailed analyses, we gained insight into aspects of 
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items that made them easy, difficult, confusing, time-consuming, enjoyable, frustrating, or 
cognitively demanding. Questions about properties of operations, for instance, typically invoked 
feelings of frustration, while the videos of children solving problems seemed to have an 
invigorating effect on the teachers. The interviews also informed the selection and editing of 
response options for individual items, and we used all of this infonnation to revise, eliminate, 
and create new items. 

Using the revised items, we conducted a second round of cognitive interviews with six 
additional elementary teachers. The second round of interviews followed the same process as the 
first, including the sharing of audio recordings and extensive follow-up conversations among the 
development team (i.e., the authors of this paper). Following the cognitive interviews, we 
focused considerable attention on confirming plausible incorrect responses that reflected the 
thinking observed among the target population and limiting the number of response options in 
multiple-choice items accordingly. We sought to limit the amount of time required to read the 
items. We translated several vocabulary terms in the draft items to synonyms that were used by 
the teachers, and we edited multiple-choice response options to have similar grammatical 
structure, vocabulary, and length (Haladyna, Downing, & Rodriguez, 2002). Above all, items 
were edited and proofed endlessly. The item refinement phase occurred over a period of four 
months of intensive effort and critical feedback and discussion. 

Field Test 

After the item refinement phase was complete, the bank of remaining items consisted of 
items that had not been eliminated on the grounds that they were too easy, too difficult, too time 
consuming, or failed to illuminate whether teachers had the type of knowledge or ability we 
sought to measure. Partly influenced by the observations in the cognitive interviews that teachers 
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showed clear signs of fatigue after about the 75-90 minutes, we aimed to set the number of items 
such that teachers were likely to complete the test in 60 minutes or less. Thus, we set a target to 
keep approximately 35-40 items. To decide which items to keep, we first examined how many 
items remained in each of the test blueprint categories. For the categories with more than three 
items in them, we identified pairs or groups of items with very similar structure that were 
designed to measure the same facet of knowledge. We identified and retained the items from 
these pairs or groups that seemed to be the most effective and efficient at illuminating the 
targeted types of knowledge with teachers in the cognitive interviews, and we removed the 
others. This process yielded 40 items. 

These 40 items were used to create an online version of the instrument using the 
Qualtrics software, a web-based platform that afforded a multimedia approach. Some items 
include image files depicting student work or videos of students solving mathematics problems. 
The test blueprint in Table 3 shows the revised categories of items represented on the K-TEEM 
and the final number of items in each category after the Spring 2014 web-based field test and 
data analysis. Five of the 40 items were removed after data collection, and the reasons are 
discussed in the following section. These steps in the Field Test phase occurred over a period of 
six months, not including the time required to recruit participants. 

Scale Refinement 

The 40-items used in the Spring 2014 field test consisted of 30 multiple-choice items, three 
fill-in-the-blank items, and seven constructed-response items. The multiple-choice items were 
scored in accordance with an a priori determination of correct responses. After the field test data 
were collected, the development team worked as an adjudication committee to examine all of the 
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responses to the fill-in-thc-blank answers to determine the set of possible correct answers to 
those questions. 

We created rubrics to score each of the constructed-response items. These rubrics had been 
drafted based upon the responses observed in the cognitive interviews and were then refined 
through an iterative process of scoring, comparing scores, and refining the scoring criteria. After 
the first draft of the rubrics were created, the members of the development team scored a subset 
of items individually. These scores were then compared, and all discrepancies were discussed 
and resolved by the full group. Two of the seven constructed-response items were dropped from 
the scale during the scoring process due to a combination of difficulty in defining objective 
scoring criteria and shortcomings in achieving sufficiently high percent of exact agreement in 
rating. Full consensus on every score was achieved in every case for the remaining items. 

All items on the K-TEEM test were ultimately scored dichotomously (i.e., correct, incorrect), 
and statistics for the remaining 38 individual items were generated using Rasch (1960) models. 
The Rasch model output data identified three items with poor model fit. (See the Results section 
for further discussion.) Those three items were subsequently removed. The final 2014 K-TEEM 
scale includes 35 items involving a mix of multiple-choice (27 items), fill-in-the-blank (2 items), 
and constructed-response (6 items) formats. Scoring, data modeling, and interpretation of results 
occurred over a period of four months. 

Validation Framework 

Kane (2006) provides a useful way of framing test validation in terms of two basic 
components: the interpretation argument and the validity argument. The interpretation argument 
is focused on what a test score means or, put another way, what can we infer that a score tells us 
about the test-taker. The validity argument focuses on how a test is used and whether the use and 
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inference thereof is appropriate and defensible. With respect to the process of test development 
and validation, Kane further distinguishes among the development stage and the appraisal stage. 
Our current work is situated in the development stage. As such, we focus most of our attention 
on the interpretation argument while attempting to provide clear direction for the subsequent 
appraisal and validity argument. 

Our current work focuses on building an argument to support the interpretation of the test 
score, while subsequent work will appraise the ability of the K-TEEM to serve its intended use. 

In the previous sections, we defined the domain of interest (i.e., MKT at the early elementary 
level in the domain of number and operations, and equality), offered a test blueprint and other 
test specifications for the K-TEEM, and described the development process including an iterative 
process of subjecting items to expert review and cognitive interviews. In the following sections, 
we will describe a feasibility test and share related findings. 

Description of the Sample and Setting 

The sample for this study includes early elementary grades teachers of mathematics 
(kindergarten through second grade) and instructional support personnel (e.g., math coaches, 
intervention specialists) who signed up to take part in a teacher professional development 
program in mathematics. All of the teachers in the sample worked as teachers in the state of 
Florida. The data for this study were gathered in spring 2014 during the last nine weeks of the 
school year. The web-based questionnaire was administered with participants involved in two 
separate randomized controlled trials of mathematics professional development programs serving 
teachers in early elementary grade levels (n = 405). All of the teachers in both the cognitive 
interviews and the field-test phases were remunerated for their participation. 
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Sample 1. Approximately half of the teachers (i.e., Sample 1) completed the online 
instrument while signing up to participate in a randomized controlled trial that would evaluate 
the impact of a 10-day summer workshop based on the Thinking Mathematics program 
(Bodenhausen et ah, 2014). Sample 1 data used in this study were collected prior to random 
assignment and delivery of professional development, so the teachers were not aware of what 
condition they would be assigned (i.e., treatment, control), and they had not yet participated in 
any professional development offered through the program. The teachers in Sample 1 (n = 206) 
represented 26 school districts, spanning the full geographic range of the state and including 
urban, suburban, and rural areas. Eligibility for enrollment for Sample 1 was constrained to those 
school districts that met the criterion for being high-needs, as defined by a student enrollment at 
or above the level of 50% of students qualifying for free or reduced-price lunch. 

The average number of years of experience among the teachers in Sample 1 was 10.75 
(SD = 7.75) years. The minimum number of years of experience was 1, and the maximum 
number of years of experience was 33. Three of the 206 Sample 1 teachers (1.5%) reported 
having earned a college degree specifically in mathematics or mathematics education. Sample 1 
teachers predominantly identified as female (93.7%). Sample 1 consisted mostly of classroom 
teachers (96%), with only eight participants (4%) identifying with an instructional support role. 

Sample 2. The remaining teachers (i.e., Sample 2) in the 2014 field study were 
completing an end-of-year post-test for the first year of a two-year-long randomized controlled 
trial evaluating a professional development program based on Cognitively Guided Instruction 
(CGI; Carpenter, Fennema, Franke, Levi, & Empson, 1999). Approximately half of the Sample 2 
teachers were in the treatment condition, and the other half were in a practice-as-usual control 
condition. Sample 2 teachers were from two school districts in the same state as the Sample 1 
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teachers. One of those school districts was a very large district with urban, suburban, and rural 
areas within it. The other district was a medium-sized school district serving primarily suburban 
areas. Some of the teachers in Sample 1 were from the same two districts as the teachers from 
Sample 2, but none of the individual teachers were included in both samples. 

The average number of years of experience among the teachers in Sample 2 was 11.63 
(SD = 8.84) years. The minimum number of years of experience was 1, and the maximum 
number of years of experience was 48. Three of the 199 Sample 2 teachers (1.5%) reported 
having earned a college degree specifically in mathematics or mathematics education. Sample 2 
teachers predominantly identified as female (98.5%). Sample 2 consisted mostly of classroom 
teachers (91%), with eighteen participants (9%) identifying with an instructional support role. 
Analytic Strategy 

We use the Rasch (1960) model to obtain both the item difficulty and the person ability 
estimates. The joint maximum likelihood estimation of both the person ability and item difficulty 
parameters in the Rasch model provides a distinct advantage over simply assigning a person’s 
ability as the percent of items answered correctly. Utilizing differences in the difficulty levels for 
individual items, the Rasch model is able to differentiate between similar respondent patterns 
occurring at separate points in the scale. By having different spacing between Rasch-based 
scores, we better reflect the true ability differences between people. All analyses for this paper 
were conducted with Winsteps (Linacre, 2016). 

Allowing us to place item difficulty and teacher knowledge on the same scale, the Rasch 
approach is both convenient and easy-to-interpret. It also provides some improvement in 
precision, as precision is maximized in the center of the Rasch score distribution versus in the 
tails for raw score scales (Bond & Fox, 2007). We also chose the Rasch approach, because it 
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helps us to evaluate whether the underlying construct of teacher mathematics knowledge is 
sufficiently unidimensional. Use of a raw score would make an implicit assumption of 
unidimensionality, whereas Rasch statistics allow us to evaluate the tenability of that assumption. 
This is an important advantage over using raw scores, given that a major contribution of the 
MKT is the broad range of knowledge for the instruction of mathematics. 

Field Test Findings 

The web-based field test involved 405 practicing teachers who completed the K-TEEM in 
in spring 2014. The length of time most participants required to complete the test was between 
30 and 50 minutes. There were a few technical problems, mostly involving participants 
experiencing difficulties logging in to the system or accessing videos embedded into the items. 
These technical problems were resolved in all known cases. 

Item-level Performance Across Samples 

Overall, the respondents in the two samples differed across multiple items in the observed 
probability of a correct response. Table 4 displays descriptive statistics for the overall sample 
and two subsamples. Mean item-level scores in the final set of 35 items for Sample 1 ranged 
from 12% (item 25) to 85% (item 16) of the teachers correctly responding to an item. Less than 
20% of Sample 1 (n = 199) answered correctly on items 14, 25, and 35. On the other hand, more 
than 70% of this sample answered correctly on items 1, 16, 20, and 22. The item-level percent- 
correct responses for Sample 2 (n = 199) ranged from 18% (item 14) to 85% (item 22). Less than 
20% of Sample 2 answered correctly on item 14. Greater than 70% of Sample 2 teachers 
answered correctly on items 1, 4, 9, 12, 16, 20, 22, and 37. 


Rasch Model Fit 
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To determine whether the K-TEEM items fit the Rasch model, item fit mean square 
(MNSQ) was examined within and across each sample. Items were considered misfit if the 
MNSQ estimates were either less than 0.6 or greater than 1.4 (Bond & Fox, 2007). Low values 
of MNSQ may be indicative of redundancy with other items, while high values may indicate 
items out of sync with other items in the measure. 

Table 5 displays the Infit and Outfit MNSQ values as well as item difficulties and 
discrimination parameters within each group. Across both samples, items 6 and 21 demonstrated 
the worst fit to the model as well as the lowest correlation to the underlying construct. These two 
items were dropped from further analysis. Item 33 represented the hardest item in the test 
(0=2.47) and was eliminated for the negative impact misalignment of item difficulty can have on 
individual person proficiency estimates, to help eliminate items with overt guessing, and to hone 
the dimensionality of the overall measure (Andrich & Marais, 2014). The remaining K-TEEM 
items had acceptable infit statistics within the .60 to 1.4 range for both samples. Given the small 
deviations from the expected within the fit statistics, it is not surprising the Rasch model 
accounted for a large portion of variance within the measure overall (77.1%), and for Sample 1 
(79.0%) and Sample 2 (74.9%). 

Discrimination Index 

In the process of fitting the sample data to the Rasch model, the beginning steps within 
the analysis assume that all items have the same difficulty (e.g., 1.00) and fit the underlying 
model. Final derivations from this initial expected difficulty as influenced by the pattern of 
individual responses provide an indication of fit to the Rasch model. Linacre (2006) suggests a 
range for interpretable discrimination between 0.5 and 2.0 with values greater than 1.0 indicating 
highly discriminant items and values less than 1.0 as less discriminant. In other words, high 
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value discrimination items are more likely to be answered by teachers high in MKT than teachers 
low in MKT. Low discrimination items indicate less distinction between high or low MKT. 

Based on pooled data from the two subsamples, Table 5 displays the discrimination 
estimate for each item. The highest discrimination values were found for items within the SCK 
and CCK trajectories across both samples and ranged from 1.25 to 1.48. Beyond highly 
discriminating within each sample, these questions demonstrate moderately high correlations to 
the overall measure. 

Low discrimination values are more problematic, as these questions fail to differentiate 
between test takers of different ability levels. Across both samples, all but one item (item 18; 
discrimination estimate = 0.45) was above the lower threshold of 0.5. This item, however, 
demonstrates acceptable fit and was subsequently retained for examination within future data 
collection and analysis. 

Person and Item Reliability and Item Separation 

Person separation reliability measures the degree to which the scale differentiates persons 
on the items. It is calculated by Winsteps as the ratio between the true person variance to the 
observed person variance and ranges from 0 to 1. Values greater than .80 are generally 
considered to indicate adequate reliability. Overall, the person reliability estimate fell slightly 
below the .80 cut point (.75). The lowest level of person reliability was found for Sample 1 (.66). 
With a person reliability of .79, the person reliability for Sample 2 is slightly below the preferred 
cutoff of .80, indicating adequate person separation reliability for this group. The analogous 
Cronbach’s alpha across all items and samples was .75 (Sample 1 a=.66; Sample 2 a=.79). 

The person separation index estimates the spread of individuals across the measure items 
and is calculated as the adjusted standard deviation divided by the error standard deviation. 
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Values above 2.0 are indicative of adequate spread (Bond & Fox, 2007; Linacre, 2005). Less 
separation of persons across items was also seen within the Sample 1 (1.40). Overall (1.74) and 
for Sample 2 (1.93), values were closer, but still lower than the 2.0 cutoff. 

Item separation reliability quantifies how well a sample of participants can separate the 
items on the measure (Wright & Stone, 1999). It is calculated by Winsteps by dividing true item 
variance by the observed item variance (Bond & Fox, 2007) and also ranges from 0 to 1. Item 
separation was excellent across the combined group as well as within both Sample 1 and Sample 
2 (.97-98). 

Person separation indices indicate how efficiently a set of items is able to capture levels 
of skill within a sample (Wright & Stone, 1999). Person separations were good and suggested 
between 5 and 8 levels of skill within the measure. Tables 6 and 7 display the item- and person- 
separation statistics. 

Discussion 

The goal of the work we report in this article was to create an assessment instrument that 
could be used efficiently at a large scale to measure MKT specific to the early elementary level. 
Informed by our review of the CCSS-M and the content of the PD program and by external 
expert reviews of our items and test blueprint, we are confident that the K-TEEM generates an 
interpretable score that corresponds to knowledge for teaching early elementary grades 
mathematics. We think the K-TEEM may be relevant for teachers of early elementary 
mathematics in general, but we think it is especially relevant for use with those teachers who are 
working in environments where the CCSS-M (or similar standards) are a key feature in the 
school accountability system. The current K-TEEM scale focuses on the topics of number, 
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operations, and algebraic thinking; it does not currently measure knowledge of other topics such 
as measurement or geometry. 

Through the field test of the K-TEEM, the test administration and scoring procedures 
were determined to be an acceptable level of burden for both the teachers and the test 
administrators. The person- and item-reliability estimates appear to meet basic standards for 
educational and psychological measurement, although the discrepancy in some of the statistics 
between the two samples warrants further investigation and refinement of the items or scale. 

The Rasch model produces a single score that lends itself well to typical models 
commonly used to investigate the effects of teacher PD interventions, such as multilevel analysis 
of covariance. The Rasch-based score may also be used to investigate associations between 
teachers’ MKT and student learning outcomes—a link that has proven to be elusive in extant, 
large-scale studies. The Rasch model accounted for approximately 75% of the variance in the 
underlying construct of mathematical knowledge for teaching and had excellent to acceptable 
levels of infit and outfit across both samples. 

The assumption of unidimensionality in the Rasch model enabled us to obscure the 
potential distinctions among facets of knowledge in the subdomains of MKT in favor of defining 
their relationship to an underlying, singular construct. Because of the interrelatedness of the 
various facets of knowledge we were attempting to measure, the Rasch model seemed an 
appropriate analytic method at this stage, and the unidimensionality assumption appears to 
withstand some gentle scrutiny. Future work should examine the strength of the assumption of 
unidimensionality and possible alternative structures within the K-TEEM. 

Many researchers in mathematics education have been reluctant to use rigorous 
evaluation designs (e.g., randomized controlled trials) to measure the effect of educational 
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interventions. This reluctance may, in part, be due to the limited availability of assessment 
instruments designed for large-scale studies that can pass muster with researchers in mathematics 
education. Indeed, poor alignment between the substantive content measured by a measurement 
instrument and the focus of an educational program is a major threat to internal validity of the 
results of a study. By design, the content of the K-TEEM aligns with some of the most widely 
acknowledged findings in the corpus of research in early mathematics. By measuring the 
substantive content considered to be important by scholars in mathematics education, perhaps the 
K-TEEM can provide a critically important tool to support the transformation toward rigorous 
evaluation designs becoming more common in mathematics education. 

Methodological Considerations in MKT Item and Scale Development 

In the following sections, we discuss a few particularly important considerations that may 
provide further insight into the content and structure of the K-TEEM. We faced these decisions 
in the development of the K-TEEM, but we think these considerations are generally applicable in 
the development of any assessment instruments intending to measure teacher knowledge. 

Use of context in MKT items. It is common to include scenarios involving teachers and 
students in assessment items designed to measure MKT. This can be one aspect of the items in 
an MKT test that make it different from other types of mathematical knowledge tests. Three of 
the released items in Appendix A use this technique. The technique is used partly in attempt to 
demonstrate to the test taker that these questions or problems are relevant to the work they do as 
teachers. Used cleverly, it may also improve the interpretability of the underlying trait or ability 
the test is trying to measure by measuring knowledge or ability in a way that is situated into the 
setting or scenario in which the person might use that particular knowledge. 
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Figure 1 contains the original version of Released Item 4 in Appendix A. We originally 
thought the context involving the word problem might provide support for the test takers to th ink 
about different ways to solve the problem. Through the cognitive interviews, we found that the 
introduction to the problem and the associated word problem within it drew a disproportionate 
amount of the test-taker’s attention and increased the length of time required to complete the 
item. We removed most of the context of this particular item, and the resulting item is Released 
Item 4 in Appendix A. There were many items in which the context was determined to be an 
integral part of the item and contributed to the central idea in the item. In those cases, the context 
was retained. When the context was not necessary to serve the purpose of a given item, the 
context was removed to minimize reading time and cognitive load. 

Mrs. Jones presented the following problem to her first-grade class with intent to discuss the 
ways children might use their knowledge of number facts to help them solve problems. 

Jonah had 6 cars and 8 trucks in his toy vehicle collection. If Jonah displays his cars 
and trucks on a special shelf, how many vehicles will there be on the shelf? 

Describe as many different ways as you can think that a child might solve this problem 
correctly. 

Figure 1. Solve Many Ways item involving an unnecessary scenario. 

Verifying specialized and pedagogical content knowledge. One challenge in measuring 
specialized content knowledge and pedagogical content knowledge using test items scored 
dichotomously (as either correct or incorrect) is in determining types of knowledge that can be 
scored as correct or incorrect objectively and definitively. To protect the integrity of the 
interpretation of the score on the test as being free from bias with respect to certain 
epistemologies or theories of instruction, the items associated with the SCK, KCS, and KCT 
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subdomains were required to meet at least one of the two criteria: correctness by definition, or 
correctness by substantial empirical evidence. Released item 1 in Appendix A conforms to these 
criteria, while released items 2 and 3 were ultimately judged to be non-conforming to these 
criteria (and were consequently removed from the scale). 

An example of items that are scored for correctness by definition are the items that ask 
teachers to observe a child solving a problem and select the name of the strategy used by the 
student. The names of strategy types and their corresponding definitions are provided in research 
literature published in refereed journals and highly credible summaries of those works (e.g., 
Carpenter et ah, 1999; Sarama & Clements, 2009). To address differences in vernacular, a write- 
in option was made available if the available multiple-choice options did not include a correct 
answer in the form the test-taker expected to see it. The write-in responses were subsequently 
reviewed by an adjudication committee to determine which ones were equivalent to the 
predetermined correct answer. For instance, Carpenter et al. (1999) use a term direct modeling to 
mean something very similar to what Sarama and Clements (2009) might call concrete modeling. 
In common teacher vernacular, teachers might call this same idea concrete representation. The 
adjudication process is used to monitor scoring and determine whether these responses might be 
considered to be synonymous (although the latter two were not observed in the field test data). 

For other items, we set an additional necessary condition of having sufficient empirical 
evidence gathered through studies published in refereed sources to support the judgment of 
correctness. One such category of knowledge in the KCS domain is that of relative word problem 
difficulty. The items in the Relative Problem Difficulty category of knowledge in the KCS 
domain and the veracity of their answers is determined by a synthesis of results in published data 
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gathered between 1970-2000 and a recent replication of those findings using data gathered in the 
U.S. during 2013-2016 (Schoen, Champagne, & Whitacre, 2015). 

Constructed-response and multiple-response item types. We began the development 
process with the perspective that open-ended items were inherently superior to items presented in 
a multiple-choice format for probing some facets of MKT, such as teachers’ abilities to interpret 
student solutions. Through our experiences in the scoring and cognitive interview processes, our 
confidence in the ability of open-ended items to reliably measure some important aspects of 
knowledge and ability has decreased. We use Released Item 2 in Appendix A as an example. 

With released item 2, we wanted to find out whether teachers would identify that students 
might use a relational thinking approach (Carpenter, Levi, Franke, & Zerinque, 2004) to quickly 

detennine the value of the unknown quantity in the equation 46 + 27 =_+ 26. Some teachers 

offered responses such as, “He added 1 to 46.” This response leaves a lot of ambiguity and 
uncertainty with respect to the question of whether the teacher has insight into relational 
thinking. The item was removed from the final K-TEEM scale out of concerns about having to 
choose between misinterpreting responses or trivializing the scoring procedures. Of course, the 
underlying problem might not be in the item type but on some other aspect such as clarification 
of the construct the item was intended to measure or limited ability of the question itself to yield 
insight into this type of knowledge. 

Methodologists in educational and psychological measurement argue that, under certain 
conditions, selected-response (i.e., multiple-choice) items are superior to constructed-response 
items in their ability to reliably measure knowledge and ability (Downing, 2006). For future 
versions of the K-TEEM, we intend to convert some of the constructed-response items into 
multiple-choice items by borrowing frequently observed ideas in the teachers’ written 



DEVELOPING A K-TEEM ASSESSMENT INSTRUMENT 


responses—both correct and incorrect—to serve as selected-response options. Done well, we 
think this approach can simultaneously improve the measurement qualities of the item and the 
efficiency of the scoring process. While we have become more confident in the potential value 
and usefulness of selected-response items, we think it is critically important to find out how the 
target population actually responds to the questions and to use the respondents’ exact words in 
the response options in order to maximize item reliability and validity for use with the population 
of interest. 

Next Steps 

We think that we have taken important first steps in the development of a valid and 
reliable way to measure MKT at the early elementary level. We also expect further development 
and investigation will provide important insight into both the underlying construct we are trying 
to measure as well as an appraisal of the validity of use for its intended purpose. For instance, 
testing whether the K-TEEM is sufficiently sensitive to detect group differences in MKT will be 
important for validation purposes. 

Other natural next steps will involve multidimensional models (e.g., factor analytic 
models, multidimensional IRT models) and an analysis of differential item functioning (DIF). It 
will be important to perform a DIF analysis to investigate whether items are biased with respect 
to specific PD programs or are invariant with respect to repeated measurement or individual 
characteristics of teachers. DIF analysis and (possible) respecification of the test composition 
base on the results of the DIF analysis could yield insight into the discrepancies in item 
difficulty, model fit, and reliability observed between Sample 1 and Sample 2. An investigation 
of the multidimensionality may yield empirical insights into separation between theorized 
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subdomains within the MKT construct. Both of these procedures will require larger samples of 
teachers to provide sufficient statistical power. 

There are surprisingly few large-scale, empirical findings to support the claim of a 
correlation between teacher knowledge and student learning in mathematics or in other subject 
areas. Given the overwhelming agreement among scholars with the claim that the association 
should exist, further investigation of these associations is very important. We do have student 
achievement data for half of the teachers in the current sample, and another particularly 
important next step will involve an investigation of whether teachers’ scores on the K-TEEM 
scale can predict student achievement or learning gains. 

Conclusions 

Developing an assessment instrument to meet standards in educational and psychological 
measurement is a humbling experience that requires tremendous attention, effort, and expertise. 
Ultimately, our goal will be to offer the K-TEEM for the research and evaluation community to 
use. While it is not yet perfect, we think the K-TEEM may fill an important gap in the set of 
tools available to researchers and evaluators for the purpose of investigating the associations 
among teacher MKT, professional development, student learning, and other factors of interest. 
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Appendix A 

Released Items from Questionnaire Development Process 
Release Item 1 (Multiple Choice) 

Sequence the three problems that follow from least difficult to most difficult for most 
first graders at the beginning of the year to solve correctly. 


Note: You may assume that the students can have the problems read aloud as many times 
as needed and that they have the option to use paper and pencil or manipulatives. 


Problem A 

Problem B 

Problem C 

The candy bowl has 5 
peppermint candies and 14 
butterscotch candies. How 
many more butterscotch 
candies are there than 
peppermint candies? 

There were some candies 
in the bowl. Anna came 
and put 9 new candies in 
the bowl. Now the bowl 
has 14 candies. How many 
candies were in the bowl 
before Anna came? 

The candy bowl was filled 
to the top with 14 candies. 
Anna grabbed 5 candies 
out of the bowl to share 
with her friends. How 
many candies are in the 
bowl now? 


a. A, C, B 

b. B, C, A 

c. C, A, B 

d. C, B, A 

Correct answer: c 


Release Item 2 (Open-ended Response) 

46 + 27 = rn +26 

Mr. Johnson presented this equation to his first grade class. Without writing anything, a 
student quickly called out that the missing number is 47. 

What is the most likely explanation for how the student generated the correct answer so 
quickly? 




Scoring: Credit given if explanation describes use of relational thinking 
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Release Item 3 (Multiple Choice) 

Ms. Reynolds believes several of her first grade students are ready to progress from using 
cubes or pictures to represent all of the quantities in Join (or Add To) Result Unknown 
problems to using a counting on strategy. Which of the problems below has numbers that 
are most likely to nudge these students to use counting on instead of modeling all 
quantities with objects or pictures? 

Choose the one best answer: 

a. Jon had 12 stickers in his collection. His grandma gave him 9 more stickers. How 
many stickers does Jon have now? 

b. Jon had 6 stickers in his collection. His grandma gave him 7 more stickers. How 
many stickers does Jon have now? 

c. Jon had 5 stickers in his collection. His grandma gave him 8 more stickers. How 
many stickers does Jon have now? 

d. Jon had 23 stickers in his collection. His grandma gave him 2 more stickers. How 
many stickers does Jon have now? 

Correct answer: d 


Release Item 4 (Open-ended Response) 

Describe as many ways as you can think of that a child might use number fact 
knowledge to correctly find the sum of 6 + 8. 

Please provide a detailed description or notation of the steps in each strategy, using the 
specific numbers from the problem and making clear how the answer is detennined. 

Strategy 1: 

Strategy 2: 

Strategy 3: 

Strategy 4: 

Strategy 5: 

Strategy 6: 

Scoring: To receive credit for this item, description or notation of 4 valid and distinct 
strategies for using number fact knowledge to solve 6 + 8. Specific numbers in the 
problem must be included. 
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Notes 

1 The lead author of the current article completed the training and orientation required to use the 
LMT items and scales. 

2 This list of activities and learning goals should not be interpreted as comprehensive. Rather, it is 
intended merely to describe some of the central activities and learning goals in the program. 
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Table 1 

Major Phases of Instrument Development 


Phase Duration Activities 


Review Existing 
Instruments 

8 months 

• Search for existing instruments aligned with the focus of the intervention being tested 
through review of extant literature and discussions with experts in field of 
mathematics teacher education and program evaluation 

• Identification and selection of extant and available assessment items aligned with the 
focus of the intervention 

• Field test of the available instruments with a sample from the target population 

• Analysis of resulting field-test data 

Item Generation 

4 months 

• Review of the aspects of MKT relevant to the teacher PD program and the CCSS-M 

• Development of a target blueprint detailing types of items and number of items of 
each type 

• Draft items in accordance with the target blueprint and item specifications 

Item Refinement 

4 months 

• Review of items by experts in mathematics, mathematics education research 
(including teacher education and student thinking), and practicing teachers 

• Conduct cognitive interviews with practicing teachers 

• Discuss notes and observations generated through cognitive interviews with the 
development team 

• Revise or write new items based on cognitive interview findings 

• Determination of the final set of items to be included in field test 

Field Test 

4 months 

• Transfer of paper-based items to web-based system 

• Pilot testing of web-based system 

• Full-scale field test of web-based system 

Scale Refinement 

3 months 

• Adjudication of responses for fill-in-the-blank items and scoring of responses for 
short answer items 

• Development and analysis of Rasch-based IRT models 

• Final determination of set of items based upon model results 
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Note. Some of the activities in the test and item development process occurred in an iterative and overlapping fashion. The development period 
lasted a total of approximately 15 months from start to finish. 
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Table 2 

Number of Draft Items by MKT Sub-domain and Sub-category at the Start of Cognitive 
Interviews 


Sub-domain and Sub-category Number of 

_ Items 

Common Content Knowledge 

Meaning of the Equal Sign and Related Notation 7 

Properties of Operations 7 

Solve Problems in Many Ways 4 

Specialized Content Knowledge 

Evaluating the Validity or Generalizability of Student Strategies 3 4 

Naming Student Strategies 6 

Naming Word Problem Types 5 

Writing Word Problems’ 3 3 

Knowledge of Content and Students 

Predicting Student Strategies 4 

Relative Problem Difficulty 5 

Matching Strategies and Problems 3 5 

Knowledge of Content and Teaching 

Selecting Word Problems in Service of Specific Instructional Goals 5 

Total _ 55 

Note. 


a These categories were dropped or reconceptualized based upon the information gathered in 
the cognitive interviews. 

b This category was dropped due to concerns about whether respondents would use their own 
knowledge versus consult external references in the web-based, self-paced format. 
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Table 3 


Blueprint of Items by MKT Sub-domain and Sub-category included in Final Analyses 

Sub-category of Items by Sub-domain of MKT Number of 

Items 


Common Content Knowledge 

Meaning of the Equal Sign and Related Notation 5 

Properties of Operations 4 

Solve Problems in Many Ways 2 

Specialized Content Knowledge 

Interpreting Student Strategies 4 

Naming Student Strategies 4 

Naming Word Problem Types 5 

Knowledge of Content and Students 

Predicting Student Strategies 3 

Relative Problem Difficulty 4 

Knowledge of Content and Teaching 

Selecting Word Problems in Service of Specific Instructional Goals 4 

Total 35 


Note. There were a total of 40 items on the questionnaire. After the field test data were 
collected, five items were dropped in the process of data analysis. Dropped items had been 
placed in the KCS-Predicting Student Strategies, CCK-Solve Problems in Many Ways, SCK- 
Naming Problem Types, and SCK-Interpreting Student Strategies subcategories. The final 
version of the questionnaire used in the analytic sample has 35 items. 
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Table 4 


Average Correctness and Individual Correlation Within Each Sample 


Item 

Item Code 

Sample 1 (n 

= 206) 


Sample 2 (n = 

199) 

Mean 

SD 

PB 

Mean 

SD 

PB 

1 

KCSRPD6 

.73 

.446 

.22 

.81 

.390 

.25 

2 

KCSPS2 

.37 

.485 

.23 

.47 

.500 

.31 

3 

CCKES3 

.24 

.430 

.30 

.37 

.483 

.45 

4 

KCTLG1 

.65 

.477 

.31 

.79 

.409 

.32 

5 

SCKNPT1 

.45 

.499 

.32 

.57 

.497 

.50 

6 

KCSRPD1 

.63 

.484 

- 

.56 

.497 

- 

7 

KCTLG2 

.59 

.492 

.34 

.69 

.462 

.31 

8 

CCKP07 

.45 

.499 

.38 

.38 

.487 

.36 

9 

KCSRPD5 

.68 

.469 

.32 

.77 

.423 

.37 

10 

CCKES2 

.36 

.481 

.22 

.31 

.464 

.28 

11 

KCSPS5 

.43 

.497 

.23 

.52 

.501 

.32 

12 

SCKNPT12 

.65 

.477 

.23 

.77 

.423 

.41 

13 

KCSPS6 

.67 

.470 

.19 

.61 

.488 

.29 

14 

CCKES7 

.18 

.383 

.24 

.18 

.382 

.32 

15 

SCKNS3 

.29 

.455 

.14 

.38 

.487 

.33 

16 

SCKSMW6 

.85 

.362 

.34 

.76 

.429 

.34 

17 

CCKSMW6 

.23 

.421 

.47 

.32 

.466 

.55 

18 

KCSRPD3 

.49 

.501 

.17 

.57 

.496 

.20 

19 

SCKISS3 

.25 

.435 

.22 

.38 

.486 

.42 

20 

CCKES10 

.74 

,438 

.35 

.82 

.386 

.27 

21 

CCKP03 

.21 

.412 

- 

.24 

.426 

- 

22 

SCKNPT13 

.77 

.424 

.17 

.85 

.359 

.31 

23 

SCKISS4 

.45 

.499 

.30 

.49 

.501 

.45 

24 

CCKP02 

.67 

.472 

.39 

.66 

.475 

.31 

25 

CCKSMW5 

.12 

.322 

.34 

.25 

.432 

.61 

26 

KCTLG3 

.30 

.459 

.24 

.39 

.489 

.31 

27 

SCKNPT 14 

.43 

.496 

.36 

.52 

.501 

.35 

28 

SCKNS2 

.44 

.498 

.17 

.44 

.498 

.28 

29 

SCKNS6 

.41 

.493 

.39 

.49 

.501 

.35 

30 

CCKES5 

.57 

.496 

.29 

.66 

.474 

.27 

31 

CCKP06 

.40 

.491 

.34 

.42 

.495 

.23 

32 

KCSRPD4 

.39 

.488 

.34 

.58 

.494 

.41 

33 

SCKISS1 

.07 

.256 

- 

.11 

.308 

- 

34 

CCKP05 

.36 

.482 

.28 

.35 

.477 

.27 

35 

KCTLG4 

.18 

.387 

.22 

.22 

.416 

.41 

36 

SCKISS2 

.50 

.501 

.38 

.59 

.493 

.44 

37 

SCKISS5 

.60 

.491 

.34 

.77 

.419 

.40 

38 

SCKNS7 

.39 

.488 

.34 

.45 

.499 

.41 


Note. Mean = average of percent correct response within sample for each item, SD = standard deviation; PB = 
individual item correlation to the overall measure. 3 The item coding scheme involves three fields. The first three 
letters are the code for the subdomain of MKT (e.g., CCK = Common Content Knowledge). The next two or three 
letters correspond to the subcategory of knowledge in that domain (e.g., RPD = Relative Problem Difficulty). The 
numeral at the end simply indexes the items in the item bank in that subdomain. 
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Table 5 


Item Difficulty and Fit Statistics Within and Across Both Samples 


Sample 1 (n = 206) _ _ Sample 2 (« = 199) _ _ Overall (n = 405) 


Item 

Item Code 

Item 

Infit 

Outfit 

Dis 

Item 

Infit 

Outfit 

Dis 

Item 

Infit 

Outfit 

Dis 

1 

KCSRPD6 

-1.26 

1.04 

1.06 

0.92 

-1.51 

1.03 

1.07 

0.97 

-1.37 

1.03 

1.05 

0.96 

2 

KCSPS2 

0.39 

1.05 

1.05 

0.83 

0.32 

1.07 

1.07 

0.75 

0.36 

1.06 

1.05 

0.80 

3 

CCKES3 

1.07 

0.98 

0.97 

1.03 

0.81 

0.94 

0.89 

1.18 

0.93 

0.95 

0.92 

1.11 

4 

KCTLG1 

-0.88 

0.99 

0.99 

1.03 

-1.34 

0.97 

0.94 

1.04 

-1.07 

0.97 

0.95 

1.06 

5 

SCKNPT1 

0.05 

0.99 

0.99 

1.05 

-0.15 

0.87 

0.83 

1.48 

-0.04 

0.93 

0.91 

1.33 

7 

KCTLG2 

-0.59 

0.97 

0.99 

1.11 

-0.78 

1.04 

0.99 

0.94 

-0.67 

1.00 

0.99 

1.02 

8 

CCKP07 

0.05 

0.94 

0.95 

1.27 

0.73 

1.03 

0.99 

0.95 

0.37 

1.00 

1.00 

0.99 

9 

KCSRPD5 

-0.99 

0.98 

0.97 

1.05 

-1.21 

0.95 

0.86 

1.10 

-1.09 

0.96 

0.91 

1.09 

10 

CCKES2 

0.46 

1.05 

1.06 

0.83 

1.1 

1.08 

1.12 

0.82 

0.76 

1.09 

1.11 

0.77 

11 

KCSPS5 

0.12 

1.05 

1.06 

0.74 

0.08 

1.06 

1.05 

0.78 

0.10 

1.05 

1.05 

0.77 

12 

SCKNPT12 

-0.88 

1.05 

1.04 

0.86 

-1.21 

0.91 

0.84 

1.16 

-1.02 

0.98 

0.94 

1.06 

13 

KCSPS6 

-0.97 

1.07 

1.09 

0.81 

-0.37 

1.06 

1.12 

0.79 

-0.69 

1.08 

1.14 

0.74 

14 

CCKES7 

1.5 

0.99 

1.03 

1.00 

1.97 

1.01 

1.06 

0.98 

1.72 

1.01 

1.07 

0.97 

15 

SCKNS3 

0.81 

1.09 

1.18 

0.79 

0.73 

1.05 

1.08 

0.85 

0.77 

1.06 

1.12 

0.82 

16 

SCKSMW6 

-2.03 

0.93 

0.84 

1.08 

-1.15 

0.96 

1.07 

1.05 

-1.59 

0.96 

1.04 

1.05 

17 

CCKSMW6 

1.16 

0.87 

0.78 

1.22 

1.07 

0.83 

0.77 

1.36 

1.11 

0.85 

0.77 

1.29 

18 

KCSRPD3 

-0.11 

1.09 

1.11 

0.47 

-0.17 

1.16 

1.16 

0.43 

-0.14 

1.12 

1.13 

0.45 

19 

SCKISS3 

1.02 

1.04 

1.07 

0.93 

0.76 

0.96 

0.97 

1.10 

0.88 

0.99 

1.01 

1.02 

20 

CCKES10 

-1.34 

0.94 

0.93 

1.11 

-1.55 

1.00 

0.99 

0.99 

-1.43 

0.97 

0.95 

1.05 

22 

SCKNPT13 

-1.48 

1.06 

1.11 

0.89 

-1.79 

0.96 

0.89 

1.05 

-1.60 

1.01 

1.00 

0.99 

23 

SCKISS4 

0.05 

1.00 

1.01 

0.99 

0.23 

0.93 

0.92 

1.25 

0.13 

0.97 

0.97 

1.12 

24 

CCKP02 

-0.95 

0.94 

0.91 

1.19 

-0.59 

1.01 

1.16 

0.90 

-0.78 

0.97 

1.06 

1.03 

25 

CCKSMW5 

2.02 

0.93 

0.76 

1.08 

1.47 

0.77 

0.65 

1.35 

1.70 

0.83 

0.68 

1.20 

26 

KCTLG3 

0.76 

1.02 

1.06 

0.93 

0.68 

1.05 

1.13 

0.81 

0.72 

1.03 

1.09 

0.89 

27 

SCKNPT 14 

0.16 

0.96 

0.95 

1.18 

0.08 

1.03 

0.99 

0.91 

0.12 

0.99 

0.97 

1.06 

28 

SCKNS2 

0.07 

1.10 

1.10 

0.53 

0.44 

1.09 

1.13 

0.66 

0.24 

1.10 

1.13 

0.57 

29 

SCKNS6 

0.22 

0.94 

0.93 

1.26 

0.2 

1.03 

1.04 

0.89 

0.21 

0.98 

0.98 

1.08 

30 

CCKES5 

-0.51 

1.00 

1.02 

0.97 

-0.62 

1.08 

1.04 

0.82 

-0.56 

1.03 

1.03 

0.89 

31 

CCKP06 

0.26 

0.98 

0.97 

1.10 

0.53 

1.15 

1.19 

0.52 

0.39 

1.06 

1.08 

0.77 

32 

KCSRPD4 

0.33 

0.98 

0.96 

1.10 

-0.22 

0.96 

0.95 

1.15 

0.07 

0.95 

0.94 

1.21 

34 

CCKP05 

0.44 

1.01 

1.01 

0.96 

0.91 

1.10 

1.15 

0.74 

0.66 

1.07 

1.09 

0.80 

35 

KCTLG4 

1.46 

1.02 

1.01 

0.98 

1.64 

0.97 

0.84 

1.07 

1.55 

1.00 

0.95 

1.02 

36 

SCKISS2 

-0.15 

0.95 

0.94 

1.30 

-0.25 

0.94 

0.88 

1.25 

-0.20 

0.94 

0.91 

1.29 

37 

SCKISS5 

-0.61 

0.97 

1.01 

1.11 

-1.24 

0.93 

0.80 

1.14 

-0.88 

0.94 

0.92 

1.16 

38 

SCKNS7 

0.33 

0.97 

0.97 

1.11 

0.39 

0.98 

0.94 

1.10 

0.36 

0.98 

0.96 

1.10 


Note. Infit/Outfit reported as MNSQ; Item = Item Difficulty; Dis = Discrimination Index. The first three letters in the item code represent 
the subdomain of MKT (e.g., CCK = Common Content Knowledge). The next two or three letters correspond to the subcategory of 
knowledge in that domain (e.g., RPD = Relative Problem Difficulty). The numeral at the end of the item code simply provides a unique 
identifier for the item within its subdomain and subcategory. See the test blueprint in Table 2 for the full names of the subdomains and 
subcategories. 
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Table 6 

Person Separation Statistics 


Sample 

Avg. Measure 

Infit (SE) 

Outfit (SE) 

Adjusted SD 

RMSE 

Separation 

Reliability 

Overall 

.00 

1.00 (.14) 

1.00 (.23) 

.70 

.40 

1.74 

.75 

Sample 1 

-.17 

1.00 (.14) 

1.00 (.20) 

.55 

.39 

1.40 

.66 

Sample 2 

.43 

1.00 (.14) 

1.01 (.34) 

.79 

.41 

1.93 

.79 


Note. Overall n = 405; Sample 1 n = 206; Sample 2 n = 199 


Table 7 

Item Separation Statistics 


Sample 

Avg. Measure 

Infit (SE) 

Outfit (SE) 

Adjusted SD 

RMSE 

Separation 

Reliability 

Overall 

.00 

1.00 (.06) 

1.00 (.10) 

.90 

.11 

7.87 

.98 

Sample 1 

.00 

1.00 (.05) 

1.00 (.09) 

.90 

.16 

5.67 

.97 

Sample 2 

.00 

1.00 (.08) 

0.99 (.13) 

.95 

.17 

5.61 

.97 


Note. Overall n = 405; Sample 1 n = 206; Sample 2 n = 199 
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